Understanding Communication Faults in Parallel Computers

نویسندگان

João Carreira

Diamantino Costa

Henrique Madeira

João Gabriel Silva

چکیده

This paper addresses the evaluation of the dependability properties of distributed memory parallel systems through fault injection. The most popular parallel computers are based on the distributed memory architecture where loosely coupled processors communicate by message-passing. Fault tolerance is an issue which increasingly concerns manufacturers and end users of these systems as the probability of occurrence of a fault increases with the number of components, and parallel machines can have up to thousands of nodes and complex interconnection media. For the purpose of the validation of fault tolerance in these systems, both the processing nodes and the communication subsystem should be taken into account. This paper focus on the validation of communication subsystems and reports experiments conducted with the CSFI tool Communication Software Fault Injector in a commercial parallel machine with no fault handling mechanisms. Two set of experiments have been performed: one using original applications, and another using the same applications in conjunction with an application level CRC mechanism for the messages. The outcome of the experiments was analysed focusing on those faults that caused the generation of wrong results by the application without any error being detected. These cases correspond to situations in which it would be virtually impossible to detect that the benchmark output was erroneous. The results obtained show the effectiveness of the CRC as an error detection mechanism and emphasise the need for robust communication protocols in parallel machines in order to achieve confidence in the applications results and suggest that the actual quest for performance in the parallel computing industry can only be effective if it is provided along with dependability.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Assessing the Effects of Communication Faults on Parallel Applications1

This paper addresses the problem of injection of faults in the communication system of disjoint memory parallel computers and presents fault injection results showing that 5% to 30% of the faults injected in the communication subsystem of a commercial parallel computer caused undetected errors that lead the application to generate erroneous results. All these cases correspond to situations in w...

متن کامل

Fault-Tolerant Meshes and Hypercubes with Minimal Numbers of Spares

Many parallel computers consist of processors connected in the form of a d-dimensional mesh or hypercube. Twoand three-dimensional meshes have been shown to be efficient in manipulating images and dense matrices, whereas hypercubes have been shown to be well suited to divide-andconquer algorithms requiring global communication. However, even a single faulty processor or communication link can s...

متن کامل

Parallel Spatial Pyramid Match Kernel Algorithm for Object Recognition using a Cluster of Computers

This paper parallelizes the spatial pyramid match kernel (SPK) implementation. SPK is one of the most usable kernel methods, along with support vector machine classifier, with high accuracy in object recognition. MATLAB parallel computing toolbox has been used to parallelize SPK. In this implementation, MATLAB Message Passing Interface (MPI) functions and features included in the toolbox help u...

متن کامل

Scalable Fault Tolerance in Multiprocessor Systems

Evolving trends in design and use of computers are resulting in fault-prone systems which may not execute a program to completion. Checkpoint-and-recovery is commonly used to recover from faults and complete parallel programs. Conventional checkpointing-and-recovery can incur high overheads and may be inadequate in the future as faults become frequent. We propose to execute parallel programs de...

متن کامل